{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Author:__ Bram Van de Sande\n",
    "\n",
    "__Date:__ 2 FEB 2018\n",
    " \n",
    "__Outline__ This notebook assesses the read performance of two different formats for storing whole genome rankings.\n",
    "\n",
    "1. The legacy format based on SQLite3 using schema defined as:\n",
    "```\n",
    "CREATE TABLE rankings (geneID VARCHAR(255), ranking BLOB);\n",
    "CREATE TABLE motifs (motifName VARCHAR(255), idx INTEGER);\n",
    "```\n",
    "The ranking of a gene for all regulatory features for which it was scored and ranked is stored as a BLOB.\n",
    "2. The new feather format which need to be installed by:\n",
    "```\n",
    "pip install pyarrow\n",
    "```\n",
    "and provides a fast and minimal API to store and read pandas' ``DataFrame``s. This format also allows to read only a subset of the columns of the dataframe stored.\n",
    "\n",
    "Cave: The HDF5 fileformat interfaced through the pandas framework via the PyTables packages is not used because of the current 2000 columns limit (https://stackoverflow.com/questions/16639503/unable-to-save-dataframe-to-hdf5-object-header-message-is-too-large)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import glob, os\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from pyscenic.rnkdb import SQLiteRankingDatabase, FeatherRankingDatabase, convert2feather\n",
    "from pyscenic.genesig import GeneSignature"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "DB_FOLDER = \"/Users/bramvandesande/Projects/lcb/databases/\"\n",
    "RESOURCES_FOLDER = \"../resources/\"\n",
    "NOMENCLATURE = \"HGNC\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the performance assessment existing human whole genome rankings stored in the legacy format are going to be used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def derive_db_name(fname):\n",
    "    return os.path.splitext(os.path.basename(fname))[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[SQLiteRankingDatabase(name=\"hg19-tss-centered-5kb-10species.mc9nr\",nomenclature=HGNC),\n",
       " SQLiteRankingDatabase(name=\"hg19-500bp-upstream-10species.mc9nr\",nomenclature=HGNC),\n",
       " SQLiteRankingDatabase(name=\"hg19-tss-centered-10kb-7species.mc9nr\",nomenclature=HGNC),\n",
       " SQLiteRankingDatabase(name=\"hg19-500bp-upstream-7species.mc9nr\",nomenclature=HGNC),\n",
       " SQLiteRankingDatabase(name=\"hg19-tss-centered-5kb-7species.mc9nr\",nomenclature=HGNC),\n",
       " SQLiteRankingDatabase(name=\"hg19-tss-centered-10kb-10species.mc9nr\",nomenclature=HGNC)]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "DB_GLOB = os.path.join(DB_FOLDER,\"hg19*.db\")\n",
    "sqlitedb_fnames = glob.glob(DB_GLOB)\n",
    "sqlitedbs = [SQLiteRankingDatabase(fname=fname, name=derive_db_name(fname), nomenclature=NOMENCLATURE) for fname in sqlitedb_fnames]\n",
    "sqlitedbs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These rankings need to be converted to the new feather format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "feather_fnames = [convert2feather(fname, DB_FOLDER, derive_db_name(fname), NOMENCLATURE) for fname in sqlitedb_fnames]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[FeatherRankingDatabase(name=\"hg19-tss-centered-5kb-10species.mc9nr\",nomenclature=HGNC),\n",
       " FeatherRankingDatabase(name=\"hg19-500bp-upstream-10species.mc9nr\",nomenclature=HGNC),\n",
       " FeatherRankingDatabase(name=\"hg19-tss-centered-10kb-7species.mc9nr\",nomenclature=HGNC),\n",
       " FeatherRankingDatabase(name=\"hg19-500bp-upstream-7species.mc9nr\",nomenclature=HGNC),\n",
       " FeatherRankingDatabase(name=\"hg19-tss-centered-5kb-7species.mc9nr\",nomenclature=HGNC),\n",
       " FeatherRankingDatabase(name=\"hg19-tss-centered-10kb-10species.mc9nr\",nomenclature=HGNC)]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "featherdbs = [FeatherRankingDatabase(fname, derive_db_name(fname), NOMENCLATURE) for fname in feather_fnames]\n",
    "featherdbs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The size on disk is similar for both fileformats"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[1.099437056, 1.099437056, 1.099437056, 1.099437056, 1.099437056, 1.099437056]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fsizes = [float(os.path.getsize(fname))/1e9 for fname in sqlitedb_fnames]\n",
    "fsizes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[1.092309888, 1.092309888, 1.092309888, 1.092309888, 1.092309888, 1.092309888]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fsizes = [float(os.path.getsize(fname))/1e9 for fname in feather_fnames]\n",
    "fsizes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The gene signatures used in this performance assessement are downloaded from MSigDB (http://software.broadinstitute.org/gsea/msigdb). The module C6 is used in this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "GMT_FNAME = os.path.join(RESOURCES_FOLDER,\"c6.all.v6.1.symbols.gmt.txt\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "189"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "msigdb_c6 = GeneSignature.from_gmt(\n",
    "                        fname=GMT_FNAME,\n",
    "                        nomenclature=\"HGNC\",\n",
    "                        gene_separator=\"\\t\",\n",
    "                        field_separator=\"\\t\")\n",
    "len(msigdb_c6)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The compare the read performance every signature in the MSigDB module is loaded from all human ranking databases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "def print_report(res):\n",
    "    seconds = res.all_runs[0]\n",
    "    n_dbs = len(featherdbs)\n",
    "    n_signatures = len(msigdb_c6)\n",
    "    print(\"{}ms per load\".format((seconds/(n_dbs*n_signatures))*1000.0))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<TimeitResult : 1min 3s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%timeit -n1 -r1 -o -q\n",
    "for gs in msigdb_c6:\n",
    "    for db in featherdbs:\n",
    "        db.load(gs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "56.41426802380832ms per load\n"
     ]
    }
   ],
   "source": [
    "print_report(_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<TimeitResult : 1min 36s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)>"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%timeit -n1 -r1 -o -q\n",
    "for gs in msigdb_c6:\n",
    "    for db in sqlitedbs:\n",
    "        db.load(gs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "85.47179554585371ms per load\n"
     ]
    }
   ],
   "source": [
    "print_report(_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Conclusion:__ The new feather format has several advantages over the legacy format:\n",
    "- Faster read performance on average.\n",
    "- A far easier implementation relying entirely on an external python package.\n",
    "- Interoperability between R and python.\n",
    "- Reading entire dataframes into memory is significantly faster with the feather format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.6.4 (pyscenic_dev)",
   "language": "python",
   "name": "pyscenic_dev"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
